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Abstract — State-of-the-art techniques for probability sampling 
of users of online social networks (OSNs) are based on random 
walks on a single social relation (typically friendship). While 
powerful, these methods rely on the social graph being fully 
connected. Furthermore, the mixing time of the sampling process 
strongly depends on the characteristics of this graph. 

In this paper, we observe that there often exist other relations 
between OSN users, such as membership in the same group or 
participation in the same event. We propose to exploit the graphs 
these relations induce, by performing a random walk on their 
union multigraph. We design a computationally efficient way to 
perform multigraph sampling by randomly selecting the graph 
on which to walk at each iteration. We demonstrate the benefits 
of our approach through (i) simulation in synthetic graphs, and 
(ii) measurements of Last.fm- an Internet website for music 
with social networking features. More specifically, we show that 
multigraph sampling can obtain a representative sample and 
faster convergence, even when the individual graphs fail, i.e., 
are disconnected or highly clustered. 

Index Terms — Sampling methods, Social network services, 
Last.fm, Random walks, Multigraph, Graph sampling. 

I. Introduction 

The popularity of Online Social Networks (OSNs) has 
skyrocketed within the past decade, with the most popular 
having at present hundreds of millions of users (a number 
that continues to grow apace). This success has inspired 
a number of measurement and characterization studies, as 
well as studies of the interaction between OSN structure and 
systems design, and of user behavior within OSNs. Despite 
their attractions, the large size and access limitations of most 
OSN services (e.g., API query limits, treatment of user data 
as proprietary) make it difficult or impossible to obtain a 
complete census of user accounts and/or topology. Sampling 
methods are thus essential for practical estimation of OSN 
properties. While sampling can, in principle, allow precise 
inference from a relatively small number of observations, this 
depends critically on the ability to draw a sample with known 
statistical properties. The lack of a sampling frame (i.e., a 
complete list of users, from which individuals can be directly 
sampled) for most OSNs makes principled sampling especially 
difficult; recent work in this area has thus focused on sampling 
methods that evade this limitation. 

Key to current sampling schemes is the fact that OSN 
users are, by definition, connected to one another via some 
relation, referred to here as the "social graph." Specifically, 
samples of OSN users can be obtained by crawling the OSN 
social graph, obviating the need for a sampling frame. An 



early family of crawling techniques followed BFS/Snowball- 
type approaches, where nodes of a graph reachable from an 
initial seed are explored exhaustively [l]-[3]. It is now well- 
known that these techniques produce biased samples with poor 
statistical properties when the full graph is not covered [4]-[6]. 
A more recent body of work employs systematic random walks 
on the social graph, and can achieve an asymptotic probability 
sample of users by online or a posteriori correction for the 
(known) bias induced by the crawling process [6,7]. While 
random walk sampling can be very effective, its success is 
ultimately dependent on the connectivity of the underlying 
social graph. More specifically, random walks can yield a 
representative sample of users only if the social graph is fully 
connected. Furthermore, the speed with which the random 
walks converge to the target distribution strongly depends on 
characteristics of the graph, e.g., clustering. 

In this paper, we start from the observation that in OSNs, 
there are often multiple relations connecting the nodes. For 
example, users may be linked not only by direct social ties, 
but also by being members of the same group, participating in 
the same event, or using the same application. Moreover, many 
systems allow all neighbors in such relations to be enumerated 
(either through scraping or API calls). In other words, there 
often exist multiple, crawlable relation graphs — including but 
not limited to ties like "friendship" — defined on the same set 
of nodes. We propose to exploit such multiple-relation (i.e., 
multiplex) graphs by giving a crawler more edges to choose 
from, compared to a crawler restricted to one relation only 
(typically the social graph). For example, we might be able 
to discover users that have no direct social ties, which is 
impossible by crawling the social graph alone. 

There are many ways one can exploit multiplex graphs. 
A naive approach would be to run many crawlers, one on 
each individual relation graph, and then combine the collected 
samples. However, this technique yields biased samples if any 
individual relation graph is fragmented, and fails to exploit 
opportunities for convergence acceleration by mixing across 
relations. A better approach is to combine all individual 
relation graphs into a single union (simple) graph: the resulting 
union graph is frequently connected even if its constituent 
graphs are not. Moreover, the union graph may also be less 
tightly clustered than its constituents, helping a crawler to con- 
verge faster than on the individual graphs. However, walking 
on the union graph requires, at every step, the enumeration of 
all neighbors in all relations, which can be costly in time and 



(a) Friendship graph (b) Group graph (c) Event graph 




(d) Union simple graph (e) The union multigraph contains all (f) An equivalent way of thinking the 

edges in the simple graphs multigraph as "mixture" of simple graphs. 



Fig. 1 . Multigraph sampling illustration, (a-c) Graphs for three different relation Gi : Friendship, Group and Event, (d) Union (simple) graph, as presented 
in Definition [T] (e) Union multigraph, as presented in Definition [2] Node A has degrees d\(A)=3, d,2(A)=2 and d^(A)=2 in the Friendship, Group and 
Event graphs, respectively. Its total degree in the union multigraph is d(A) = 7. (f) An alternative view of the union multigraph. The neighbor selection 
Algorithm [T] first selects a graph d with probability -^fr^y- Next, it picks a random neighbor in the selected graph Gi, i.e., with probability d .} A \ ■ 



bandwidth. Instead we propose a third, cost-efficient approach. 

We propose a novel two-stage algorithm that walks on the 
union multigraph. Our multigraph sampling first selects the 
relation on which to walk and then enumerates the neighbors 
with regards to that relation only, which makes it, in practice, 
even more efficient than union graph sampling. We prove that 
this algorithm achieves convergence to the proper equilibrium 
distribution when the union multigraph is connected. We 
also demonstrate the benefits of multigraph sampling in two 
settings: (i) by simulation of synthetic random graphs; and 
(ii) by measurements of Last . fm - an Internet website for 
music with social networking features. We chose Last . fm as 
an example of a network that is highly fragmented with respect 
to the social graph as well as other relations. We show that 
multigraph sampling can obtain a representative sample when 
each individual graph is disconnected. Along the way, we 
also give practical guidelines on how to efficiently implement 
multigraph sampling for OSNs more generally. 

The structure of the rest of the paper is as follows. SectionHIl 
describes our sampling methodology. Section [HI] evaluates 
our methodology on synthetic graphs. Section [TV] applies 
our methodology to sample Last . fm and provides practical 
recommendations. Section [V] discusses related work. Finally, 
Section |VT] concludes the paper. 

II. Sampling Methodology 

A. Terminology and Definitions 

We consider different sets of edges £ = {E\, . . . , Eq} on a 
common set of users V. Each Ei captures a symmetric relation 
between users, such as friendship or group co-membership. 
(V,Ei) thus defines an undirected graph Gi on V. We make 
no assumptions of connectivity or other special properties of 
each Gi. Fig da -0 ) shows an example of Q = 3 different 



relations and relation graphs Gi defined on the same 5 nodes. 
Fig. Gta-e) shows Q = 5 such graphs defined on a set of 50 
nodes. 

Consider set of graphs Gi = (V,Ei),i = 1, . . . , Q, defined 
on a common node set V. £ can be used to construct several 
types of combined structures on V. We will employ the 
following two such structures: 

Definition 1: The union (simple) graph G' — (V,E') of 
Gi, . . . , Gq is defined as the graph on V, whose edges are 
given by the set E' = uf =1 Ei. ■ 

Definition 2: The union multigraph G = (V, E) of 
Gi, . . . , Gq is defined as the multigraph on V, whose edges 
are given by the multiset E = l+I^Lj Ei. ■ 

Note that the union multigraph G can contain multiple edges 
between a pair of nodes, while the union graph G 1 contains 
only one (or no) edge. Every multigraph G can be reduced 
to the union graph G 1 by merging together multiple edges 
between two nodes into one. Our focus, in this paper, is on the 
union multigraph, also referred to as simply the multigraph, 
because it allows us to more efficiently implement sampling 
on multiple relations. However, we also use the union graph 
as a helpful conceptual tool. 

B. Some False Starts 

We seek to draw a sample of the nodes in V, so that 
the draws are (at least approximately) independent and the 
sampling probability of each node is known up to a constant 
of proportionality. There are several ways to achieve this goal 
using multiple graphs. We discuss some of them below. 

1 ) Naive Multiple Graph Sampling: A naive way is to run 
many random walks, one per each individual graph Gi, and 
to combine the collected samples. However, if a particular Gi 
is disconnected (as are all five graphs in Fig. |2ja-e)), a walk 
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Fig. 2. Example of multiple graphs vs. their union. Five draws from a 
random (N,p) graph with N = 50 and expected degree 1.5 are depicted 
in (a)-(e). Each simple graph is disconnected, while their union graph G' , 
depicted in (f), is fully connected. The union multigraph G (not shown here) 
is also connected: it has - possibly multiple - edges between the exact same 
pairs of nodes as G' . 

on Gi is restricted to the connected component around its 
starting point and thus never converges to the desired target 
distribution. This results in an asymptotically biased sample, 
dependent on the initial seeds. 

2) Union Simple Graph Sampling: A much better approach 
is to perform a random walk on the union graph G' . We show 
examples of union graphs in Fig[TJd) and Fig. [2jf). Note that 
although in both cases the individual graphs are disconnected, 
their union graphs are well-connected, which allows for a 
quick convergence of the random walk. 

A potential practical difficulty with a random walk on 
the union graph G' is that, at each step, computing the 
neighborhood union can be quite expensive: it requires the 
enumeration of all edges adjacent to the current vertex v, in 
each relation graph Gi. This may be very costly, depending on 
v's neighborhood size (which can be large in heavily clustered 
relations, such as group co-membership), query costs, and the 
number of relations Q. 

C. Union Multigraph Sampling 

We can address the enumeration problem of the union 
graph G' by considering the union multigraph; see Definition 2 
and example illustrated in Fig. 02 e )- We employ a random 
walk that moves from one vertex to another by selection of 
random edges on the multigraph. A naive implementation 
of such a random walk still requires the enumeration of 
all neighbors of the current node v. Instead, we propose 
to use the following two-stage neighbor-selection procedure 
described in Algorithm Q] and depicted in FigQJf), which 
requires enumeration of v's neighborhood for only a single 
graph. 

Denote by dj(v) the degree of node v in graph Gi, and by 
d(v) = Yli=i di{v) its total degree in the union multigraph. 



Algorithm 1 Multigraph Sampling Algorithm 
Require: vq e V, simple graphs Gi, i = 1 . . . Q 

1: Initialize v vq. 

2: while not Converged do 

3: Select graph Gi with probability 

4: Select uniformly at random a neighbor v' of v in Gi 

5: v <- v' 

6: end while 

7: return all sampled nodes v and their degrees d(v). 



First, we select a graph Gi with probability J^y . Second, 
we pick uniformly at random an edge of v within the se- 
lected d (i.e., with prob. j-jj/j), and we follow this edge to 
v's neighbor. This procedure is equivalent to selecting an edge 
of v uniformly at random in the union multigraph, because 

djM 1 _ 1 
d(v) ' di{v) d(v) ■ 

Note that in Step 3, Algorithm Q] requires only the values 
of degrees di(v) of all relation graphs. Only in Step 4 of 
Algorithm Q] does one enumerate all neighboring edges in the 
selected G;. Because the degree information di(v) is usually 
much cheaper to obtain (e.g., via simple low-bandwidth API 
calls) than enumerating all di(v) edges, Algorithm [T] has the 
potential to save much bandwidth compared to the union 
simple graph sampling (which enumerates all neighboring 
edges in all relations). This benefit is amplified when higher 
numbers of relations (Q) are used. Algorithm Q] may also be 
helpful in certain offline applications involving surveys and 
human respondents (e.g., RDS [8]), in which selection of 
random neighbors is possible but enumeration is not. 

Algorithm [T] leads to the following equilibrium distribution: 

Proposition 2.1: If G is connected and contains at least one 
triangle, then Algorithm Q] leads to equilibrium distribution 



Proof: Let d(v,u) = d(u,v) be the number of edges 
between nodes v and u in the union multigraph G. The 
sampling process of Algorithm Q] is a Markov chain on V 
with transition probabilities P vu = ^/ff , u, v £ V. So long 
as G is finite and connected, this random walk is irreducible 
and positive recurrent. The presence of a triangle within G 
further guarantees aperiodicity. 

A Markov chain of this type is equivalent to a random walk 
on an undirected weighted graph with edge weights w(v,u) = 
d(v,u). A random walk on weighted graph is known to have 
the unique equilibrium distribution ir(v) = ^ W ^ V w( u ) ' wnere 
w(v) = 2~2 u w{v,u) (e.g., see [9, Example 4.32]), and the 
proof follows immediately by substitution. ■ 

D. Practical Issues 

Various practical issues need to be addressed when imple- 
menting these ideas in practice. For completeness, we briefly 
repeat some good practices here, and we refer the interested 
reader to our parallel work [6,10]. 

1) Choice of crawling technique: There are many ways to 
crawl a multigraph, e.g., by using various random walks or 
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graph traversal techniques. In [10], we showed that a simple 
random walk with correction for unequal sample weights, also 
called Re-Weighted Random Walk (RWRW), is more efficient 
than competitors such as Metropolis-Hastings Random Walks. 
Therefore, throughout this paper, we employ RWRW described 
below. 

2) Re-Weighted Random Walk: Much like a classic random 
walk on a simple graph, a random walk on multigraph G 
is inherently biased towards high-degree nodes. Indeed, per 
Proposition 12.11 the probability of sampling a node v is 
proportional to its degree d(v) in G. [11 ] — [ 13] show how 
to apply the Hansen-Hurwitz estimator [14] to correct for 
this bias. Let x(v) be an arbitrary function defined on graph 
nodes V, with mean x = ppr 2~2vev x ( v )- Then 



J2veS x ( v ) / d ( v ) 



(1) 



is an unbiased and consistent estimator of x. By default, we 
use this reweighting procedure throughput the paper (referring 
to the combination of random walks with post-hoc reweighting 
as the RWRW method). 

3) Multiple Walks and Convergence Diagnostics: In pre- 
vious work [6,10], we recommended the use of multiple, 
simultaneous random walks to reduce the chance of obtaining 
samples that overweight non-representative regions of the 
graph. We also recommended the use of formal convergence 
diagnostics to assess sample quality in an online fashion, 
which help to determine when a set of walks is in approximate 
equilibrium, and hence when it is safe to stop sampling. Use of 
both multiple walks and convergence diagnostics are critical to 
effective sampling of OSNs, as our sample case (Section [TVb 
illustrates. 

In this paper, we use three convergence diagnostics, fol- 
lowing [6,10]. First, we track the running means for various 
scalar parameters of interest as a function of the number of 
iterations. Second, we use the Geweke [15] diagnostic within 
each random walk, which verifies that mean values for scalar 
parameters at the beginning of the walk (here the first 10% of 
samples) does not differ significantly from the corresponding 
mean at the end of the walk (here the last 50%). Third, we 
use the Gelman-Rubin [16] diagnostic to verify convergence 
across walks, by ensuring that the parameter variance between 
walks matches the variance within walks. 

III. Evaluation in Synthetic Graphs 

In this section, we use synthetic graphs to demonstrate 
two key benefits of the multigraph approach, namely (i) 
improved connectivity of the union multigraph, even when 
the underlying individual graphs are disconnected, and (ii) 
improved mixing time, even when the individual graphs are 
highly clustered. The former is necessary for the random walk 
to converge. The latter determines the speed of convergence. 

Erdos-Renyi graphs. In Example III-B2I and Fig. 12 we 
noted that even sparse, highly fragmented graphs can have 
well-connected unions. In Fig. [3] we generalize this example 
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Fig. 3. Multigraph that combines from several Erdos-Renyi graphs. We 
generate a collection G\, .., Gq of Q random Erdos-Renyi (ER) graphs with 
| V| = 1000 nodes and expected \E\ = 500 edges each. (top) We show 
two properties of multigraph G as a function of Q. (1) Largest Connected 
Component (LCC) fraction (Jlcc) is me fraction of nodes that belong to 
the largest connected component in G. (2) The second eigenvalue of the 
transition matrix of random walk on the LCC is related to the mixing time, 
(bottom) We also label a fraction / = 0.5 of nodes within LCC and run 
random walks of lengths 20. . . 500 to estimate /. We show the estimation 
error (measured in the standard deviation) as a function of Q (x axis) and 
walk length (different curves). 



and quantify the benefit of the multigraph approach. We con- 
sider here a collection G\,..,Gq of Q Erdos-Renyi random 
graphs (N,p) with iV=1000 nodes and p=l/1000, i.e., with 
the expected number of edges \E\ — 500 each. We then look 
at properties of their multigraph G with increasing numbers 
of simple graphs Q. 

In order to characterize the connectivity of G, we define 
Jlcc as the fraction of nodes that belong to the largest 
connected component in G. For Q=l we have Jlcc — 0.15, 
which means that each simple ER graph is heavily fragmented. 
Indeed, at least 999 edges are necessary for connectivity. How- 
ever, as Q increases, /lcc increases. With a relatively small 
number of simple graphs, say for Q = 6, we get Jlcc — 1> 
which means that the multigraph is fully connected with high 
probability. In other words, combining several simple graphs 
into a multigraph allows us to reach (and sample) many nodes 
otherwise unreachable. 

Note that this example illustrates a more general phe- 
nomenon. Given Q independent random graphs with TV nodes 
each and with expected densities pi, . . . ,pq, the probability 
that an edge {u, v} belongs to their union graph G' is 
p* = 1 — []9 (1 — Pi). For pi approximately equal, this ap- 
proaches 1 exponentially fast in Q. Asymptotically, the union 
graph will be almost surely connected where (N—l)p* > In N 
[17, pp413— 417], in which case the union multigraph is also 
trivially connected. Thus, intuitively, a relatively small number 
of sparse graphs are needed for the union to exceed its 
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connectivity threshold. 

In order to characterize the mixing time, we plot the second 
eigenvalue A2 of the transition matrix of the random walk (on 
the LCC). A2 is well-known to relate to the mixing time of the 
associated Markov chain [18]: the smaller A2, the faster the 
convergence. In Fig. EJtop), we observe that A2 significantly 
drops with growing Q. (However, note that adding a new edge 
to an existing graph does not always guarantee the decrease 
of A2. It is possible to design examples where A2 increases, 
although they are rare.) 

To further illustrate the connection between A2 and the 
speed of convergence, we conducted an experiment with a 
simple practical goal: apply random walk to estimate the size 
of an exogenously defined "community" in the network. We 
labeled as "community members" a fraction / = 0.5 of nodes 
within LCC (these nodes were selected with the help of a 
randomly initiated BFS to better imitate a community). Next, 
we ran 100 random walks of lengths 20. ..500 within this 
LCC, and we used them to estimate /. In Fig. [3] (bottom), 
we show the standard error of this estimator, as a function 
of Q (x axis) and walk length (different curves). This error 
decreases not only with the walk length, but also with the 
number Q of combined graphs. This means that by using 
the multigraph sampling approach we improve the quality of 
our estimates. Alternatively, we may think of it as a way to 
decrease the sampling cost. For example, in Fig. [3] a random 
walk of length 500 for Q=3 (i.e., when LCC is already close 
to 1) is equivalent to a walk of length 100 for Q ~ 8, which 
results in a five-fold reduction of the sampling cost. 

ER Graph Plus Random Cliques. One may argue that ER 
graphs are not good models for capturing real-life relations. 
Indeed, in practice, many relations are highly clustered; e.g., 
a friend of my friend is likely to be my friend. In an extreme 
case, all members of some community may form a clique. 
This is quite common in OSNs, where we are often able to 
browse all members of a group, or all participants of an event. 

Interestingly, the multigraph technique is efficient also under 
the presence of cliques. In Fig. SJa), we consider one ER 
graph, combined with an increasing number of random cliques. 
We plot the same three metrics as in Fig [3] and we obtain 
qualitatively similar results. This robustness is a benefit of the 
multigraph approach. 

Random Graphs with Clustering. Finally, in Fig EJb), we 
consider a combination of random graphs with clustering [19]. 
The results confirm our previous observations. 

IV. Multigraph Sampling of Last.fm 

In this section, we apply multigraph sampling to Last . f m- 
a music -oriented OSN that allows users to create communities 
of interest that include both listeners and artists. Last . fm 
is built around an Internet radio service that compiles a 
preference profile for each listener and recommends users 
with similar tastes. In June 2010, Last . fm was reported to 
have around 30 million users and was ranked in the top 400 
websites in Alexa. We chose Last . fm to demonstrate our 
approach because it provides an example of a popular OSN 
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(a) Combination of one ER graph (|V| = 200 nodes and \E\ = 100 
edges) with a set of k — 1 cliques (of size 40 randomly chosen nodes 
each). 
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(b) Combination of multiple regular random graphs with cluster- 
ing [19]. We set the parameters such that each of \V\ = 1000 nodes 
has degree equal to 2, and each edge participates in exactly one triangle. 

Fig. 4. Multigraphs resulting from a combination of various graphs. 



that is fragmented with respect to the social graph (referred 
to on the site as "friendship") as well as other relations. For 
example, many Last . fm users mainly listen to music and do 
not use the social networking features, which makes it difficult 
to reach them through crawling the friendship graph; likewise, 
users with similar music tastes may form clusters that are 
disconnected from other users with very similar music tastes. 
This intuition was confirmed by our empirical observations. 
Despite these challenges, we show that multigraph sampling 
is able to obtain a fairly representative sample in this case, 
while single graph sampling on any specific relation fails. 

A. Crawling Last . fm 

We sample Last . fm via random walks on several individ- 
ual relations as well as on their union multigraph. Fig [5] shows 
the information collected for each sampled user. 

1) Walking on Relations: We consider the following rela- 
tions between two users: 

• Friends: This refers to mutually declared friendship be- 
tween two users. 

• Groups: Users with something in common are allowed 
to start a group. Membership in the same group connects 
all involved users. 

• Events: Last . fm allows users to post information on 
concerts or festivals. Attendees can declare their intention 
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Fig. 5. Information collected for a sampled user u. (a) userName and 
user.getlnfo: Each user is uniquely identified by her userName. The API call 
user.getlnfo returns : real Name, userlD, country, age, gender, subscriber, 
playcount, number of playlists, bootstrap, thumbnail, and user registration 
time, (b) Friends list: List of mutually declared friendships, (c) Event list. 
List of past and future events that the user indicates she will attend. We store 
the eventID and number of attendees, (d) Group list. List of groups of which 
the user is a member. We store the group name and group size, (e) Symmetric 
neighbors. List of mutual neighbors. 



to participate. Attendance in the same event connects all 
involved users. 

• Neighbors: Last . fm matches each user with up to 
50 similar neighbors based on common activity, mem- 
bership, and taste. The details of neighbor selection 
are proprietary. We symmetrize this directed relation by 
considering only mutual neighbors as adjacent. 

First, we collect a sample of users by a random walk on 
the graph for each individual relation, that is Friends, Groups, 
Events, and Neighbors. Then, we consider sets of relations 
(namely: Friends-Events, Friends-Events-Groups, and Friends- 
Events-Groups-Neighbors) and we perform a random walk 
on the corresponding union multigraph. In the rest of the 
section, we refer to random walks on different simple graphs 
or multigraphs as crawl types. 

2) Uniform Sample of userlDs (UNI): Last . fm 
usernames uniquely identify users in the API and HTML 
interface. However, internally, Last . fm associates each 
username with a userlD, presumably used to store user 
information in the internal database. We discovered that 
it is possible to obtain usernames from their userlDs, a 
fact that allowed us to obtain a uniform, "ground truth" 
sample of the user population. Examination of registration 
and ID information indicates that Last . fm allocates 
userlDs in an increasing order. Fig. [6] shows the exact 
registration date and the assigned userlD for each sampled 
user in our crawls obtained through exploration. With 
the exception of the first ~ 2M users (registered in 
the first 2 years of the service), for every userlDi > 
userID2 we have registration_time{userID{) > 
registration_time(userID2). We also believe that userlDs 
are assigned sequentially because we rarely observe non- 
existent userlDs after the ~ 2,000,000 threshold. We 
conjecture that the few non-existent userlDs after this 
threshold are closed or banned accounts. At the beginning of 
the crawl, we found no indication of user accounts with IDs 
above ~ 31,200,000. Just before the crawls, we registered 
new users that were assigned user IDs slightly higher than 
the latter value. 




User registration date 



Fig. 6. Last . fm assigns userlDs in increasing order after 2005: userlD vs 
registration time. 



Using the userlD mechanism, we obtained a reference 
sample of Last . fm users by uniform rejection sampling [20]. 
Specifically, each user was sampled by repeatedly drawing 
uniform integers between and 35 million (i.e., the maximum 
observed ID plus a ^4 million "safety" range) and querying 
the userlD space. Integers not corresponding to a valid userlD 
were discarded, with the process being repeated until a match 
was obtained. IDs obtained in this way are uniformly sampled 
from the space of user accounts, irrespective of how IDs are 
actually allocated within the address space [21]. We employ 
this procedure to obtain a sample of 500K users, referred 
to here as "UNI." We note that the same method has been 
recently used in [6], as well as in [22]. The latter also examined 
population growth and active vs. inactive users, which are out 
of the scope of this paper. 

Although UNI sampling currently solves the problem of 
uniform node sampling in Last . f m and is a valuable asset for 
this study, it is not a general solution for sampling OSNs. Such 
an operation is not generally supported by OSNs. Furthermore, 
the userlD space must not be sparse for this operation to be 
efficient. In the Last . fm case, the small userlD space makes 
this possible at the time of this writing; however, a simple 
increase of the userlD space to 48 or 64 bits would render 
the technique infeasible. In summary, we were able to obtain 
a uniform sampling of userlDs and use it as a baseline for 
evaluating the sampling methods of interest against the target 
distribution. 

3) Estimating Last.fm population size: In addition to the 
UNI sample presented in Table U we obtained a second UNI 
sample of the same size one week later. We then applied the 
capture-recapture method [23] to estimate the Last.fm user 
population during the period of our crawling. According to 
this method, the population size is estimated to be : 



Pi 



UNIl 



x Ni 



UNI2 



Last.fm — 



R 



28. 5 M, 



where Njjnii = Nun 12 = 500-ftT and R is the number of 
valid common userlDs sampled during the first and second 
UNI samples. This estimation is consistent with our observa- 
tions of the maximum userlD space and close to the reported 
size of Last . fm on various Internet websites. We will later 
use this second sample to comment on the topology change 
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Crawltype 


Friends 


Events 


Groups 


Neighbors 


Friends-Events 


Friends-Events- 
Groups 


Friends-Events- 
Groups-Neighbors 


UNI 


# Total Users 
% Unique users 

# Users kept 
Crawling period 


5x50iC 

71.0% 

245K 

07/13-07/16 


5x50K 

58.5% 

245K 

07/13-07/18 


5x50if 

74.3% 

245K 

07/13-07/17 


5x50K 

53.1% 

245K 

07/13-07/17 


5x50if 

59.4% 

200K 

07/13-07/18 


5x50K 

75.5% 

187K 

07/13-07/18 


5x50K 

75.6% 

200K 

07/13-07/21 


500K 
99.1% 
500K 

07/13-07/16 


Avg # friends 
Avg # groups 
Avg # events 
(past/future) 


10.7 
2.40 

2.44/0.17 


18.0 
4.71 

7.49/0.56 


15.8 
5.22 

3.96/0.28 


12.2 
2.90 

2.94/0.27 


9.8 
2.47 

2.30/0.17 


6.8 
0.71 

0.74/0.05 


6.6 
0.67 

0.73/0.04 


1.2 
0.30 

0.28/0.02 



TABLE I 

Summary of collected datasets in July 2010. The percentage of users kept is determined from convergence diagnostics. Averages 
shown are after convergence and re-weighting which corrects sampling bias. 



during our crawls. 

4) Topology Change During Sampling Process: While sub- 
stantial change could in theory affect the estimation process, 
Last . f m evolves very little during the duration of our crawls. 
The increasing-order userlD assignment allows us to infer the 
maximum growth of Last . fm during this period, which we 
estimate at 25/\/day on average. Therefore, with a population 
increase of 0.09%/day, the user growth during our crawls (2- 
7 days) is calculated to range between 0.18% — 0.63% per 
crawl type, which is quite small. Furthermore, the comparison 
between the two UNI samples revealed almost identical dis- 
tributions for the properties studied here, as shown in Fig [7] 
Therefore, in the rest of the paper, we assume that any changes 
in the Last . fm network during the crawling period can be 
ignored. This is unlike the context of dynamic graphs, where 
considering the dynamics is essential, e.g., see [7,24,25]. 

5) Efficient Multigraph Sampling in Last . fm : To collect 
data from Last . fm, we use a combination of API calls and 
data scraping. Consider that we are sampling user u. For 
efficient implementation of multigraph sampling we proceed 
in two stages, as shown in Fig [TJf). 

In the first stage, we discover the graphs of user u, 
and u's degrees in them. In our study, we use the 
API calls user . getf riends, user . getneighbors, 
user . getpastevents, and user . getevents to col- 
lect the list of friends, neighbors, past events, and future events 
respectively. Due to a lack of an API call that lists the groups 
of a user, we use data scraping to collect the list of groups and 
corresponding size for each group. We treat each individual 
group and event as a different graph in the multigraph. We 
also consider the set of friends and neighbors to comprise the 
friends and neighbors graph respectively in the multigraph. At 
the end of the first stage, we select one of the graphs Gi in 
accordance with Algorithm Q] in Section IH1 

We should note that at the end of the first stage, we have not 
enumerated any user from any of the groups and events graphs. 
Each of these graphs is quite large (up to tens of thousands 
of users) and depending on the user, there are many groups or 
events per user (up to thousands). On the other hand, we have 
enumerated users of friends and neighbors since knowledge 



of neighborhood size is equivalent to enumeration Q for these 
graphs. Overall, our two stage approach saves us bandwidth 
and time by avoiding the enumeration of users for graphs that 
we are not going to sample from at each iteration. 

In the second stage, we pick uniformly at random one of 
the nodes from the graph G;, selected at the end of the first 
stage. If the graph Gi is a graph of a group or an event, we 
need to carefully implement this action to be efficient. More 
specifically, we do not need to enumerate all group members 
or event attendants from a group or event graph. Instead, we 
can take advantage of the pages functionality that OSNs often 
provide and only fetch the page that corresponds to the user 
selected uniformly at random. In our study, to fetch group 
members we use the API call group . getmembers, which 
returns 50 users per page. To fetch event attendants we use 
data scraping, which also returns 50 users per HTML page. 

6) Data Collection: We used a cluster of machines to 
execute all crawl types under comparison simultaneously. For 
each crawl type, we run \Vq\ = 5 different independent walks. 
The starting points for the five walks, in each crawl type, 
are randomly selected users identified by the web site as 
listeners of songs from each of five different music genres: 
country, hip hop, jazz, pop and rock. This set was chosen to 
provide an overdispersed seed set, while not relying on any 
special-purpose methods (e.g., UNI sampling). To ensure that 
differences in outcomes do not result from choice of seeds, 
the same seed users are used for all crawl types. We let each 
independent crawl continue until we determine convergence 
per walk and per crawl, using online diagnostics as introduced 
in [6] and described in Section ITI-D3 1 Eventually, we collected 
exactly 50K samples for each random walk crawl type. Finally, 
we collect a UNI sample of 500K users. 

7) Summary of Collected Datasets: Table U summarizes the 
collected datasets. Each crawl type contains 5 x 5QK = 250/\ 
users. We observe that there is a large number of repetitions 
in the random walks of each crawl type, ranging from 25% 
(in Friends-Events-Groups-Neighbors) to 47% (in Neighbors). 
This appears to stem from the high levels of clustering ob- 

1 There might be workarounds to enumerating friends but they are not 
necessarily more efficient. For example, we could extract the number of 
friends by data scraping. In general, in another setting we could do away 
with any kind of enumeration in the first stage. 

2 We prefer data scraping to the API call event . getattendees because 
the API call i) is not paged ii) does not return users that marked "maybe" for 
the event iii) is very slow for large events. 
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Crawl Type 


Friends 
Graph 


Events 
Graphs 


Groups 
Graphs 


Neighbors 
Graphs 


Friends 


100% 


0% 


0% 


0% 


Events 


0% 


100% 


0% 


0% 


Groups 


0% 


0% 


100% 


0% 


Neighbors 


0% 


0% 


0% 


100% 


Friends-Events 


2.2% 


97.8% 


0% 


0% 


Friends-Events-Groups 


0.3% 


5.4% 


94.3% 


0% 


Friends-Events-Groups-Neighbors 


0.3% 


5.5% 


94.2% 


0.02% 



TABLE II 

Percentage of time a particular graph (edges corresponding 

TO THIS GRAPH) IS USED DURING THE CRAWL BY ALGORITHmQ] 



served in the individual networks. It is also interesting to note 
that the crawling on the multigraph Friends-Events-Groups- 
Neighbors is able to reach more unique nodes than any of the 
single graph crawls. 

Table [TT] shows the fraction of Markov chain transitions 
using each individual relation. The results for the single-graph 
crawl types Friends, Events, Groups, and Neighbors are as 
expected: they use their own edges 100% of the time and 
other relations' 0%. Besides that, we see that Events relations 
dominate Friends when they are combined in a multigraph, 
and Groups dominate Friends, Events, and Neighbors when 
combined with them. This occurs because many groups and 
events are quite large (hundreds or thousands of users), leading 
participants to have very high relationship-specific degree 
for purposes of Algorithm Q] and thus for the Group or 
Event relations to be chosen more frequently than low-degree 
relations like Friends. In the crawl types obtained through a 
random walk, the highest overlap of users is observed between 
Groups and Friends-Events-Groups-Neighbors (66K) while 
the lowest is between Neighbors and Friends-Events (5K). 
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nfuture events _ 
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1000 



10000 



100000 



Iterations 

Fig. 8. Convergence diagnostic tests w.r.t. to four different user properties 
("nfriends": number of friends, "ngroups": number of groups, "npast_events": 
number of past events, "nfuture_events": number of future events, and 
"subscriber") and three different crawl types (Friends, Groups, Neighbors). 



It is noteworthy that despite the dominance of Groups and 
the high overlap between Groups and Friends-Events-Groups- 
Neighbors, the aggregates for these two crawl types in Table 
Ulead to very different samples of users. 

B. Evaluation Results 

1 ) Convergence: Burn-in. To determine the burn-in for each 
crawl type in Table HJ we run the Geweke diagnostic separately 
on each of its 5 chains, and the Gelman-Rubin diagnostic 
across all 5 chains at once, for several different properties 
of interest. The Geweke diagnostic shows that first-order 
convergence is achieved within each walk after approximately 
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nfuture events 
subscriber 



Friends-Events-Groups-Neighbors 



nfriends 
ngroups 
npast events 
nfuture events 
subscriber 



100 1000 10000 100000 

Iterations 

Fig. 9. Convergence diagnostic tests w.r.t. to four different user properties 
("nfriends": number of friends, "ngroups": number of groups, "npast_events": 
number of past events, "nfuture_events": number of future events, and 
"subscriber") and three different crawl types (Friends-Events, Friends-Events- 
Groups, Friends-Events-Groups-Neighbors) . 



500 iterations at maximum. For the single relation crawl types 
(Friends, Events, Groups, and Neighbors), the Gelman-Rubin 
diagnostic indicates that convergence is attained within 1000 
iterations per walk (target value for R below 1.02 and close 
to 1) as shown in Fig. [8] 

On the other hand, multigraph crawl types take longer to 
reach equilibrium. Fig.|9]presents the Gelman-Rubin (R) score 
for three multigraph crawl types (namely Friends-Events, 
Friends-Events-Groups, Friends-Events-Groups-Neighbors) 
and five user properties (namely number of friends, number 
of groups, number of past/future events, and subscriber - 
a binary value which indicates whether the user is a paid 
customer). We observe that it takes 10K, 12. 5K and 10K 
samples for each crawl type correspondingly to converge. 
However, as we show next, they include more isolated users 
and better reflect the ground truth, while the single graph 
sampling methods fail to do so. This underscores an important 
point regarding convergence diagnostics: while useful for 
determining whether a random walk sample approximates 
its equilibrium distribution, they cannot reliably identify 
cases in which the equilibrium itself is biased (e.g., due to 
non-connectivity). For the rest of the analysis, we discard the 
number of samples each crawl type needed to converge. 

Total Running Time. Before we analyze the collected 
datasets, we verify that the remaining walk samples, after 
discarding burn-in, have reached their stationary distribution. 
Table H] contains the "Number of users kept" for each crawl 
type. We use the convergence diagnostics on the remaining 
samples to assess convergence formally. The results are qual- 
itatively similar to the burn-in determination section. We also 
perform visual inspection of the running means in Fig \W\ for 
four different properties, which reveals that the estimation of 
the average for each property stabilizes within 2-4k samples 
per walk (or 10k-20k over all 5 walks). 



Crawl Type 


Friends 


Future 


Past 


Groups 




Isolates 


Events 


Events 


Isolates 






Isolates 


Isolates 




— — ; 

Friends 


U to 


yj. 1 /c 


/ j . L /c 


AO ACL. 


Events 


1 Q 10L 
Ly.Z. to 


lo.L to 


4. J /c 


"r 1 . / /C 


Groups 


21.2% 


89.9% 


62.0% 


0.0% 


Neighbors 


40.4% 


89.5% 


71.2% 


62.4% 


Friends -Events 


6.2% 


93.5% 


69.9% 


61.6% 


Friends -Events - Group s 


5.5% 


98.15% 


88.1% 


85.3% 


Friends -Events - Group s-Nei ghbors 


7.4% 


98.3% 


86.7% 


86.3% 


UNI 


87.9% 


99.2% 


96.1% 


93.8% 



TABLE III 

Percentage of sampled nodes that are isolates (have degree 0) 
w.r.t. to a particular (multi)graph. 



2) Discovering Isolated Components: As noted above, part 
of our motivation for sampling Last . fm using multigraph 
methods stems from its status as a fragmented network with 
a rich multigraph structure. In particular, we expected that 
large parts of the user base would not be reachable from the 
largest connected component in any one graph. Such users 
could consist of either isolated individuals or highly clustered 
sets of users lacking ties to rest of the network. We here call 
isolate, any user that has degree in a particular graph relation. 
Walk-based sampling on that particular graph relation has no 
way of reaching those isolates, but a combination of graphs 
might be able to reach them, assuming that a typical user 
participates in different ways in the network {e.g., a user with 
no friends may still belong to a group or attend an event). 

In Table [III] we report the percentage of nodes in each 
crawl type that are estimated to be isolates, and compare 
this percentage to the UNI sample. Observe that there is an 
extremely high percentage of isolate users in any single graph: 
e.g., UNI samples are 88% isolates in the Friends relation, 96- 
99% isolates in the Events relation, and 93.8% isolates in the 
Groups relation. Such isolates are not necessarily inactive: for 
instance, 59% of users without friends have either a positive 
playcount or playlist value, which means that they have played 
music (or recorded their offline playlists) in Last . fm, and 
hence are or have been active users of the site. This confirms 
our expectation that Last . fm is indeed a fragmented graph. 

More importantly, Table [III] allows us to assess how well 
different crawl types estimate the % of users that are isolates 
with respect to a particular relation or set of relations. We 
observe that the multigraph that includes all relations (Friends- 
Events-Groups-Neighbors) leads to the best estimate of the 
ground truth (UNI sample - shown in the last row). The 
only exception is the friends isolates, where the single graph 
Neighbors gives a better estimate of the percentage of isolates 
over all other crawl types. The multigraph crawl type Friends- 
Events-Groups-Neighbors uses the Neighbors relation only 
0.02% of of the time, and thus does not benefit as much as 
might be expected (though see below). A weighted random 
walk that put more emphasis on this relation (or use of a rela- 
tion that is less sparse) could potentially improve performance 
in this respect. 
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Fig. 10. Single graph vs multigraph sampling. Sample mean over the number of iterations, for four user properties (number of friends, number of groups, 
number of past events, % subscribers), as estimated by different crawl types. 



3) Comparing Samples to Ground Truth: 
I. Comparing to UNI. In Table [Till we saw that multigraph 
sampling was able to better approximate the percentage of 
isolates in the population. Here we consider other user prop- 
erties (namely number of friends, past events, and groups a 
user belongs to, and whether he or she is a subscriber to 
Last.fm). In Fig [TO) we plot the sample mean value for 
four user properties across iteration number, for all crawl 
types and for the ground truth (UNI). One can see that 
crawling on a single graph, i.e., Friends, Events, Groups, or 
Neighbors alone, leads to poor estimates. This is prefigured 
by the previous results, as single graph crawls undersample 
individuals such as isolates on their corresponding relation, 
who form a large portion of the population. We also notice 
that Events and Groups alone consistently overestimate the 
averages, as these tend to cover the most active portion of 
the user base. However combining them together with other 
relations helps considerably. The multigraph that utilizes all 
relations, Friends-Events-Groups-Neighbors, is the closest to 
the truth. For example, it approximates very closely the avg 
number of groups and % of paid subscribers (Figs |10(b)| 
[T0(dl ). 

In Fig [TT] we plot the probability distributions for four user 
properties of interest. Again, the crawl type Friends-Events- 
Groups-Neighbors is closest to the ground truth, in terms of 
shape of the distribution and vertical distance to it. Neverthe- 
less, we observe that in both the probability distribution and 
the running mean plots, there is sometimes a gap from UNI, 
which is caused by the imperfect approximation of the % of 
isolates. That is the reason that the gap is the largest for the 
number of friends property (Fig |10(a)| |1 l(a)[ ). 



II. Comparing to Weekly Charts. Finally, we compare the 
estimates obtained by different crawl types, to a different 
source of the ground truth - the weekly charts posted by 
Last . fm. This is useful as an example of how one can 
(at least approximately) validate the representativeness of a 
random walk sample in the absence of a known uniform 
reference sample. 

Last . fm reports on its website weekly music charts and 
statistics, generated automatically from user activity. To men- 
tion a few examples, "Weekly Top Artists" and "Weekly Top 
Tracks" as well as "Top Tags", "Loved Tracks" are reported. 
Each chart is based on the actual number of people listening 
to the track, album or artist recorded either through an Audio- 
scrobbler plug-in (a free tracking service provided by the site) 
or the Last . fm radio stream. To validate the performance 
of multigraph sampling, we estimate the charts of "Weekly 
Top Artists" and "Weekly Top Tracks" from our sample of 
users for each of the crawl types in Table U and we compare 
it to the published charts for the week July 04- July 11 2010, 
i.e., the week just before the crawling started. To generate the 
charts from our user samples, we utilize API functions that 
allow us to fetch the exact list of artists and tracks that a user 
listened during a given date range. Fig. [T2l shows the observed 
artist/track popularity rank and the percentage of listeners for 
the top 420 tracks/artists (the maximum available) from the 
Last.fm Charts, with the estimated ranks and percentage 
of listeners for the same tracks/artists in each crawl type. As 
can be seen, the rank curve estimated from the multigraph 
Friends-Events-Groups-Neighbors tracks quite well the actual 
rank curve. Additionally, the curve that corresponds to the UNI 
sample is virtually lying on top of the "Last . fm Charts" 
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Fig. 11. Single graph vs multigraph sampling. Probability distribution function (pdf) for three user properties (number of friends, number of groups, number 
of past events), as estimated by different crawl types. 
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(b) Popularity of tracks 

Fig. 12. Weekly Charts for the week 07/04-07/1 1. Artists/tracks are retrieved 
from "Last . fm Charts" and remain the same for all crawl types. Data is 
linearly binned (30 points). Inset: Artist/track popularity rank and percentage 
of listeners for all artists/tracks encountered in each crawl type. 



line. On the other hand, the single graph crawl types Friends, 
Events, Groups, and Neighbors are quite far from actual charts. 
Here, as elsewhere, combining multiple relations gets us much 
closer to the truth than would reliance on a single graph. 



V. Related Work 

Early graph exploration methods that were used to measure 
OSNs were based on BFS and snowball sampling [l]-[3]. 
These methods have been shown to have a generally unknown 
bias towards high degree nodes when far from completion. In 
our recent and ongoing work, we attempt to correct for this 
bias [5,26]; however, BFS is out of the scope of this paper. 
Recent work in [6,7,27] used random walks (where the bias is 
known) to sample users in OSNs, namely Friendster, Twitter 
and Facebook. Random walks have also been used to sample 
peer-to-peer networks [28]-[30] and other large graphs [31]. 

Design of random walk techniques to improve mixing 
include [18,32]-[34]. Boyd et al. [18] pose the problem of 
finding the fastest mixing Markov Chain on a known graph 
as an optimization problem. However, in our case such an 
exact optimization is not possible since we are exploring 
an unknown graph. Ribeiro et al. [32] introduce Frontier 
sampling and explore multiple dependent random walks to 
improve sampling in disconnected or loosely connected sub- 
graphs. Multigraph sampling has the same goal but instead 
achieves it by exploring the social graph using multiple 
relations. Therefore, Frontier sampling is an orthogonal idea, 
which can potentially be combined with multigraph sampling 
for additional benefits. Multigraph sampling is also remotely 
related to techniques in the MCMC literature {e.g., Metropolis- 
coupled MCMC or simulated tempering [33]) that seek to 
improve Markov chain convergence by mixing states across 
multiple chains with distinct stationary distributions. In [34,35] 
Thompson et al. introduce a family of adaptive cluster 
sampling (ACS) schemes, which are designed to explore nodes 
that satisfy some condition of interest; although random walk 
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sampling is distinct from cluster sampling, the former does fit 
more broadly within the area of adaptive designs. 

As noted in Section IIV-A4I we consider that the network 
of interest remains static during the duration of the crawl. We 
confirmed that this is a good approximation in the case of 
Last . f m by comparing two snapshots taken one week apart. 
Therefore, in this work, we do not consider dynamics, which 
are essential in other contexts [7,24,25]. 

Recent data collection studies of Last . fm include: [36], 
which develops a track recommendation system using social 
tags and friendship between users, [37], which examines user 
similarity to predict social links, and [38], which explores the 
meaning of friendship in Last . fm through survey sampling. 
We emphasize that the importance of a representative sample 
is crucial to the usefulness of such datasets. 

In our previous work [6] and its extended version [10], 
we proposed a framework for crawling a single graph. In 
the implementation part of this paper, we adopt some of the 
practical recommendations of that work (e.g., the use of the 
RWRW as the preferred crawling technique, the use of online 
convergence diagnostics, etc). However, our focus here is on 
comparing multigraph sampling vs. single graph sampling, and 
on demonstrating its utility on fragmented networks such as 
Last . fm. To the best of our knowledge, our work is the 
first to explore sampling OSNs on a combination of multiple 
relations. 

VI. Conclusion 

In this paper, we have introduced multigraph sampling - 
a novel technique for random walk sampling of OSNs using 
multiple underlying relations. Multigraph sampling generates 
probability samples in the same manner as conventional ran- 
dom walk methods, but is more robust to poor connectivity 
and clustering within individual relations. As we demonstrate 
using the Last . fm service, multigraph methods can give 
reasonable approximations to uniform sampling even where 
the overwhelming majority of users in each underlying rela- 
tion are isolates, thus making single-graph methods fail. Our 
experiments with synthetic graphs also suggest that multigraph 
sampling can improve the coverage and the convergence time 
for partitioned or highly clustered networks. Given these 
advantages, we believe multigraph sampling to be a useful 
addition to the growing suite of methods for sampling OSNs. 

The focus of this paper was on (i) demonstrating the utility 
of multigraph sampling compared to singe graph sampling 
and (ii) on the design of a two-stage efficient algorithm that 
implements the idea. 

Open questions include the selection of a few -out of many 
candidate- relations to use when sampling, so as to optimize 
the multigraph sampler performance. Intuitively, we expect 
that negatively correlated relations will prove most effective. 
A related question is the weighting of the different relations 
for the same purpose. Gaining intuition into these problems 
will be particularly helpful in designing optimal OSN sampling 
schemes and is a direction for future work. 
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